-
Notifications
You must be signed in to change notification settings - Fork 187
feat: clarify the expected behavior, and rationale, of the post join filter #807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: clarify the expected behavior, and rationale, of the post join filter #807
Conversation
| | Post-Join Filter | A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition. | Optional | | ||
| | Post-Join Filter | A boolean condition to be applied to each potential match between the left and right | ||
| inputs. If it evaluates to false then the potential match is not considered a match. A join relation with | ||
| Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y. | Optional | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Much better than my local draft! :)
Two more things.
- Align Hash/MergeJoin post-join filter description with this. We could refer
JoinRelthere and leave what's different. - Should this be
Optional, default Truelike hash/merge join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align Hash/MergeJoin post-join filter description with this. We could refer JoinRel there and leave what's different.
Can you expand on what you mean here? The PR does currently update the hash/merge join descriptions. I don't include the A join relation with Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y statement because this is not true for hash/merge join (the join expression for these relations is a series of equality conditions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the language of the description. The way you describe is more explicit that the post_join_filter IS part of the join condition, say saying what "matches" and what "does not match". This is not for try to reduce the output.
Also, can we drop Equi form HashEquiJoin? :)
Is this strictly true? As in a consumer must resolve both expressions on the same inputs? If so, I think it'd be nice to add a comment in the .proto file to the effect of " |
| // The post-join filter is a filter that is applied to the result of the join before an output | ||
| // record is produced. If the filter evaluates to false then the record is not considered a | ||
| // match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ambiguous for functions that aggregate over many tuples. I think a "simple" example is:
- the
post_join_filteris a comparison (lte) that uses a window function (count) - the
expressionis a predicate with selectivity between 0 and 100% - the
expressionproduces many tuples from one input for a single tuple of the other input
(2) and (3) are necessary for ambiguous scenarios to occur and (1) is where the ambiguity is expressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, I think "applied to the result of the join before an output record is produced" lends itself to being misunderstood because the "result" of the join sounds like the result of applying expression, but I think to be accurate to "equivalent to a join relation with Join Expression X AND Y" you must evaluate post_join_filter on the inputs to expression even if you only evaluate its "truthyness" on joined records that expression evaluates as true.
Maybe something more like "applied to the inputs of the join, before an output record is produced" is better and equally concise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it's better to clarify the predicates are evaluated over the inputs. Like @drin 's suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to "evaluated over the inputs". As for when it's applied, I'm still not too sure about what is the supposed behavior tbh. Let's say you're joining two tables a LEFT JOIN b ON ... with a post-join filter that has a.Col1 = b.Col2. Is a.Col1 = b.Col2 expression also supposed to follow join type semantics and leave the unmatched records from the left side in the output? Or will the result be as if it had been an inner join instead of a left join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get very confused easily when talking about when this filter is applied. Here is my understanding, in naive pseudocode, of how it is applied. I'm omitting right joins, full outer joins, single joins, and mark joins for simplicity.
for left_record in left_records:
has_match = False
for right_record in right_records:
if join_expression(left_record, right_record) and post_join_filter(left_record, right_record):
has_match = True
if join_type == Inner or join_type == Left:
emit(combine(left_record, right_record):
if has_match and join_type == LeftSemi:
emit(left_record)
elif not has_match and join_type == LeftAnti:
emit(left_record)
elif not has_match and join_type == Left:
emit(combine(left_record, null))
If someone has an alternate proposal, is it possible to share your own pseudocode representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tokoko the predicates in join condition does not follow join type. The join type becomes into play depending on whether there is a matching row (i.e., intersection) or not. outer joins and antisemi joins should ensure that you have no rows that matches according to JoinRel.expression AND JoinRel.post_join_filter to correctly behave (i.e., whether to produce null padded rows (outer) or include in the output (antisemi)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. For JoinRel, can't we write something like "post-join filter is supposed to be evaluated as if it's part of the join expression" or something similar? It would be a lot simpler to understand imho rather than thinking through when during the operation it's supposed to be applied/evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tokoko that's why I initially proposed to drop post_join_filter from JoinRel in the slack discussion. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that pseudocode weston. My intention is to make the wording clearly reflect that post_join_filter(left_record, right_record) is valid and post_join_filter(combine(left_record, right_record)) is invalid.
Note (for completeness) that my naive reading of "post join" was incorrect and would have been to implement:
# above pseudocode here
...
if post_join_filter(emitted_record):
really_emit(emitted_record)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup 'post_join' really tripped me and reason i started the thread. In the systems i worked, used residual join condition/predicate rather than post.
|
It's difficult to follow the threads in this discussion. One can think of a a join with a post join filter as a composite operation that is a join followed by a filter relation. It is entirely valid translation to take a post join filter out of the join and put in a filter relation directly afterwards and vice versa. The post join filter does not logically interact with the join type at all. The composite exists because many systems have it and it can be a beneficial physical pattern. The reason it has to be stated separately from the join predicate is to have covering behavior of all possible filter conditions. I always have to remind myself of which conditions can and cannot be moved into a join evaluation clause. I'm supportive of clarifying the text if people are unclear as to what post join filter means. |
This is not the conclusion we came to. I believe the content of the PR is still accurate with the threads, so you can just review the content and ignore the discussion. For example: As it stands this filter will emit one row for each row in From your GPT link this matches:
|
|
There is a thread that discusses the description that would go in the website: discussion on website description I think further discussion on this can be deferred until agreement on Then, there's a thread discussing the comment in the .proto file for This discussion assumes that Weston's assertion in the description is the correct semantics of
In this PR, we never collectively discussed what it should be versus what it is. The description says:
And I asked:
One thing that was referenced in slack is the substrait FAQ: "The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join." This FAQ then references velox hash-join implementation, which says: "Filter is optional. If specified it can be any expression over the results of the join." It occurs to me that the FAQ says "post-join filter... is not always equivalent to an explicit Filter relation AFTER the join," yet the referenced velox documentation says "If specified, it can be any expression over the results of the join." These seem directly contradictory to me, since "the results of of the join" sounds quite a bit like "AFTER the join". |
The post join filter has very little explanation. It can also be confusing because, from a purely logical perspective, it is possible to see the post join filter as redundant. This PR attempts to clarify the description of the post join filter.